fix: real liveness probe on /health (Bug #8) by avrabe · Pull Request #54 · pulseengine/temper

avrabe · 2026-05-02T06:26:07Z

Summary

New `src/health.js` module with `evaluateHealth({scheduler, kv, dataDir})`.
Both `/health` endpoints (Express router + Probot v14 addHandler) now return 503 when any check fails.
Closes Bug feat: add auto-merge for Dependabot and thrum PRs #8 from `docs/agent-fleet/bugs.md`.

What's checked

Probe	Signal	On fail
scheduler	last tick > 2× interval ago	unhealthy → 503
kv (SQLite)	`SELECT 1` succeeds	unhealthy → 503
disk	data dir free bytes ≥ 100 MB	unhealthy → 503
any probe throws	(probe itself broken)	degraded → 200

Why three states

`degraded` exists for "the probe itself broke" cases (e.g. `statfs` throws on a missing path). Returning 200 keeps PM2 from restarting (which wouldn't fix the underlying probe issue) while still flagging it visibly in the response body.

Test plan

12 new unit tests in `tests/unit/health.test.js`
Full suite: 846 pass (was 834)
`npm run lint` clean
Live-test on netcup after deploy: kill the scheduler tick handler, verify /health returns 503

Risk

Medium — this is an intentional behaviour change. `/health` can now return non-200. External monitors that previously treated temper as always-healthy should be re-checked, but the whole point is that they should restart on real outages.

🤖 Generated with Claude Code

Why: DevOps/SRE flagged that /health returned 200 unconditionally. PM2's restart-on-failed-healthcheck and any external uptime monitor were both blind to a hung scheduler, a locked SQLite DB, or a full data disk. The bot could be silently dead for hours. What: - New module src/health.js: pure-function evaluateHealth(probes) with three checks (scheduler tick freshness, SQLite ping, disk free) and three states (healthy / degraded / unhealthy). - Scheduler exposes getLastTickAt() and getIntervalMs(); tick timestamp is updated in finally so even error paths refresh it. /health flags the scheduler as 'fail' when the last tick is older than 2× the interval. - persistent-kv exposes ping() — a SELECT 1 against the open Database. Throws on locked / broken file. - app.js wires both /health endpoints (Express router + Probot v14 addHandler) through evaluateHealth and returns 503 when probe.ok=false. The body always includes a `checks` map so PM2 logs / dashboards can see *which* check tripped. The three states: healthy: every check passed → 200 degraded: a probe missing, threw, or scheduler had no tick yet — not actionable but visible → 200 unhealthy: a check failed in a way that means the bot cannot do its job (DB ping throws, scheduler hung, disk full) → 503 Test plan: - 12 new unit tests in __tests__/unit/health.test.js covering each probe's pass/fail/error/missing branches plus the most-severe-wins aggregation rule. - Full suite: 846 pass (was 834), lint clean. - Existing /health integration tests still pass — when probes are not injected (test setup leaves them null) evaluateHealth returns healthy so the 200 path is unchanged. Risk: medium. The behaviour change is intentional — /health can now return 503. Any external monitor that currently treats temper as always-healthy should be re-checked, but the whole point is that they should now restart on real outages. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: real liveness probe on /health (Bug #8)#54

fix: real liveness probe on /health (Bug #8)#54
avrabe wants to merge 1 commit into
mainfrom
fix/health-real-liveness

avrabe commented May 2, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

avrabe commented May 2, 2026

Summary

What's checked

Why three states

Test plan

Risk

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant